Hardware-Efficient Attention for Accelerated LLM Decoding

GTA and GLA for Scalable, Low-Latency Language Model Inference

Published

May 27, 2025

Authors: T. Zadouri et al.
Published on Arxiv: 2025-05-27
Link: http://arxiv.org/abs/2505.21487v1
Institutions: Department of Computer Science, Princeton University • Princeton Language and Intelligence, Princeton University
Keywords: hardware-efficient attention, KV cache, arithmetic intensity, Grouped-Tied Attention (GTA), Grouped Latent Attention (GLA), Multi-head Latent Attention (MLA), Grouped-Query Attention (GQA), tensor parallelism, paged KV, FlashAttention3, speculative decoding, latency, throughput, LLM inference, FineWeb-Edu-100B, NVIDIA H100, memory-bound workload

Random Unsplash-style image

Large Language Model (LLM) decoding is critically hampered by the memory bandwidth required to load large key-value (KV) caches—particularly as batch sizes and context lengths increase. Modern inference workloads are frequently memory-bound, with data transfer of KV caches between memory tiers limiting the effectiveness of hardware acceleration and parallelism. Standard attention mechanisms like Multi-Head Attention (MHA) exacerbate this issue, as their high consumption of memory bandwidth does not align with current trends where computational resources have outstripped growth in memory bandwidth.

To address these challenges, this work introduces solutions designed to boost hardware efficiency while maintaining model quality:

These innovations yield strong, measurable outcomes:

Building on these results, the paper draws several key conclusions and suggests paths for future research: